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Abstract 

Often, experiments, observations or simulations generate large numbers of snapshots of the 
configurations of complex many-particle systems. It is important to find methods of extracting 
useful information from these ensembles of snapshots in order to document the motion as the system 
evolves. The most interesting information is contained in the correlated motions of individual 
constituents rather than in their absolute motion. We present a statistical method to identify 
hierarchies of correlated motions from a series of two or more snapshot configurations. This method 
is demonstrated in a number of systems, including freely-jointed polymer chains, hard plastic 
spheres, water, and proteins. These concepts are implemented as TIMME, the Tool for Identifying 
Mobility in Macromolecular Ensembles. 
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I. INTRODUCTION 



Many situations of interest generate large numbers of configurations of complex, multi- 
component systems. Examples include molecular dynamics simulations [TJ, Nuclear Mag- 
netic Resonance (NMR) structure determination experiments [2j |3], tracking individuals in 
a crowd, grains of sand in a pile [I], propagation of fire in a forest (5j [6], magnetic vortices 
in a superconductor [7], rigid spheres in viscoelastic media[Hl HOI [HI H2], cars on the 
freeway [TBI [HJ HSl HE], or packets in a network[T71 US] . These systems have significant 
underlying structure on various time scales and often include hierarchies of self-organized 
structure [13 ED]- Identifying that structure poses a significant challenge in how to describe 
and quantify the kinematics of the evolving system. 

Visually, the variations between snapshots are often characterized by direct observation 
and inference. Such an approach is highly subjective and qualitative. Direct observation is 
also severely limited because it is impossible for the human brain to simultaneously track 
the motion of thousands of densely packed objects. For this reason, there is a significant 
need for unambiguous, systematic, objective, and quantitative analysis techniques. 

For many physical quantities, the most interesting information is not the overall move- 
ment of individual objects in an ensemble of snapshots, but rather the concerted motion 
involving clusters of two or more objects. This correlated motion between individual com- 
ponent objects reflects the underlying structure of the system. Factor analytic techniques 
such as Principal Component Analysis (PCA) can identify correlated motion by inferring 
a set of basis vectors of motions that contribute most to the variation between snapshots 
[2T] . The PCA basis vectors can be thought of like a basis set of eigenmodes for the system, 
which define an easily accessible subspace. Unfortunately, PCA substitutes one complexity 
for another. Instead of having to consider a series of snapshots, a researcher using PCA has 
to consider a series of 3 N dimensional basis vectors, where iVis the number of objects in the 
system (spheres, atoms, etc.). 

Experimentally, two-point microrheology uses the cross-correlated motions of pairs of 
beads embedded in a material and subsequently imaged via microscopy and tracked, with 
many beads per field of view, and relates the component along the direction of the vector 
separating the pair to the viscoelastic response of the material in between (HI [12]. This 
technique has been shown to match up with the bulk rheology in viscoelastic materials 
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where the ensemble-averaged autocorrelated motion of individual beads differs from the 
bulk response [12] . The disadvantages to this method are that it can only give the linear 
response, and that one has to average over an ensemble of pairs of beads. 

Other approaches attempt to reduce the complexity to something more tractable and tan- 
gible. One example is the Hingefind algorithm|22j. Hingefind assumes that large collections 
move as rigid bodies connected by hinges. A related approach involves 'model-free' methods, 
which decompose into rotation vectors acting on quasi-rigid bodies[23, 24]. In model-free 
approaches, a system is decomposed into a collection of subdomains, where each subdomain 
is treated externally as a rigid body under the assumption that the low-frequency modes 
can be approximated as rigid-body motions. Internal motion within the quasi-rigid domains 
can then be treated by normal mode analysis, or by some other approximation, to reduce 
significantly the computational complexity. 

Hingefind and model-free approaches identify the bodies and hinges or axes of rotation, 
and present a simple, easily understandable picture to the researcher. Unfortunately, this 
simplified picture can be quite unrealistic, as real motion is often characterized by both large 
and small scale rearrangements, rather than by simple hinge motion between rigid bodies. 

In this paper, we present an alternative geometric approach that shares some of the 
strengths of PCA, Hingefind, and model-free approaches. We find that it is possible to apply 
a simple statistical technique to identify a hierarchy of relatively rigid or co-moving clusters, 
i.e. collections of objects that tend to move in concert. This hierarchy is parameterized by a 
single cutoff distance. Any choice of the cutoff gives a decomposition into a set of co-moving 
clusters. At a sufficiently high cutoff, all objects in the system are included in a single 
co-moving cluster, which at this level reflects rigid body motions of the entire system. As 
the cutoff is decreased, the single large co-moving cluster breaks up into successively smaller 
clusters. At a cutoff of zero, only perfectly rigid clusters, with all internal pair distances 
locked, are identified as co-moving. 

The collection of rigid body translations and rotations of the clusters at every level can be 
thought of as forming a basis for the possible motions of the system, much like in PCA. This 
hierarchy of motions is relatively easy to conceptualize compared with the basis vectors of a 
PCA analysis. Given a sufficiently large sample size, this hierarchy reflects the underlying 
structure of the system. In cases where the Hingefind approximation of rigid bodies joined 
by hinges is reasonable for a system, then at some level in the hierarchy (corresponding to 
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a particular cutoff), the hinged rigid bodies will be identified as co-moving clusters. 

Our algorithm for identifying a hierarchy of clusters is presented in the next method 
section. A number of example systems are considered, in the following section, to illustrate 
the strengths and limitations of this approach. Finally, this algorithm is contrasted with a 
number of complementary analysis techniques. 

II. METHOD 

Consider a system of N rigid objects. Assume that for any pair of objects in the system, 
a and b, there exists a well-defined distance between them, r a f,(t) at a particular time. If 
these objects are rigidly braced relative to each other, then r a b(t) will be constant over the 
lifetime of the system, and the standard deviation of r ab (t), a a b, will be zero. 



Here a a b is a symmetric N x N matrix with a a b > for all a, b and a aa = 0. As the 
tethering between a and b becomes weaker, a a b will take on some nonzero value. Completely 
disconnected objects will have some larger value of a a b- Given these pair distance deviations 
and a distance cutoff a cut , it is possible to construct a co-moving cluster decomposition. We 
consider any pair of objects a and b to be co-moving up to the cutoff a c if a a b < o"c- 

In a large system, it is useful to consider an entire hierarchy of clusters. A two dimensional 
dilution plot is prepared by ordering the basis objects along the horizontal axis and the 
cutoff along the vertical axis (23] • m a dilution plot, individual clusters are shown as colored 
horizontal stripes at a given cutoff. Considering the entire dilution plot for a system is 
instructive because it can suggest which clusters are likely to be real and which occur because 
of statistical coincidence. These later clusters will usually be small and transient and so 
rather easy to identify. 

Co-moving clusters are defined such that they satisfy an equivalence class relation. Given 
three objects a, b and c, if a a b < o c and o^c < c c > then a and c are considered to be members 
of the same cluster, even if <j ac > a c . With this definition, a long, slightly flexible rod made 
of several basis objects will be identified as a single co-moving cluster at an appropriate 




(1) 
(2) 
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choice of cutoff, as long as the motion between nearby objects does not exceed the cutoff 
and even if the ends of the rod individually move enough relative to one another to be placed 
in separate co-moving clusters in the absence of the connecting group. 

Care must be taken when applying this technique to highly constrained structures such 
as molecules. Because molecules are covalently bonded networks of atoms, simply treating 
each individual atom as an object will not produce reasonable results. The distance between 
a pair of covalently bonded atoms is fixed, up to thermal fluctuations, and remains fixed 
as the system evolves. Within a molecule, it is possible to follow a sequence of any pair 
of covalently bonded atoms. For this reason, the entire molecule would appear to be a co- 
moving cluster by the equivalence class property. The only reason the molecule might appear 
to decompose into components would be due to small random differences in the standard 
deviations of individual bond lengths, and this would only be detected with a very small 
cutoff. 

Rather than consider individual atoms, one can consider an atom and its covalently 
bonded neighbors as the most basic intrinsically rigid building block. Computing the value of 

over all combinations of pair distances between these objects yields values that accurately 
reflect the actual relative motion. Although this approach is effective, it is somewhat more 
complicated and requires additional calculations to analyze a given system. 

More generally, the user may select appropriate intrinsically rigid objects as the basic 
elements. For hard shell spheres, the spheres themselves would be a natural choice. For 
bonded atoms in molecules, the natural choice would be the object defined by an atom and 
its bonds but excluding its bonded neighbors. 

The above concepts are implemented as TIMME, a Tool for Identifying Mobility in Macro- 
molecular Ensembles. An outline of the TIMME algorithm is given by the following list: 

1. Select an appropriate intrinsically rigid object to act as the fundamental unit for the 
analysis of the system. 

2. Iterate through each pair of objects a and b in the system. 

3. Compute a a b for each pair of objects. 

4. Sort the values of <r a b in increasing order, and join together the clusters containing 
sites a and b into a composite co-moving cluster. This has the effect of including all 
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sites with Uij < <j a b = ®c 

5. Accumulate statistics for the co-moving clusters at each a a b- In particular, compute 
for each a c the fraction of sites contained in co-moving clusters containing 10 or more 
sites. 

6. Assign each site a unique label. 

For a system of A-obects, step [2] scales as 0(N 2 ). There are several obvious simplifications 
to the TIMME algorithm. In particular, considering only spatially adjacent pairs rather than 
all pairs reduces this step to one of O(N), rather than 0(N 2 ). 

III. EXAMPLES 

The TIMME algorithm makes no assumptions about the source of input data. It can 
be applied equally well to a set of snapshot coordinates from NMR structure determination 
experiments, confocal microscopy direct visualization experiments, molecular dynamics sim- 
ulations, people within a crowd, or cars on a road. Although all of the examples presented 
here are confined to three dimensional coordinates, this approach can also be used to find 
clustering hierarchies for arbitrary data embedded in any high-dimensinal space. 

Several example systems follow. Each was selected to demonstrate specific properties of 
the TIMME algorithm and features of the results it generates. The examples covered in the 
next sections are: A. the freely jointed polymer chain; B. hard spheres; C. water; and D. 
proteins. 

A. Freely-Jointed Polymer Chain 

As a model of a linear polymer, we consider a freely-jointed chain in 3-dimensions [26J. 
Each monomer subunit is modeled as a rod of length a. Linked pairs of monomers are linked 
by ball-and-socket hinges, all possible angles are sampled with equal probability. Overlaps 
between individual chain elements are allowed. 

While it is tempting to consider the joints as the fundamental objects for this system, 
such a choice is poor because any pair of adjacent joints will always be separated by a 
constant length. If the joints were used as the fundamental objects, then a TIMME analysis 
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would conclude that the entire polymer chain was a single co-moving cluster for any choice of 
cutoff, a c . A better choice is to use the links (bonds) between joints (atoms) as fundamental 
objects for the freely-jointed chain. 
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FIG. 1: Two realizations of a freely-jointed chains with 6 joints and 5 links. Each link has length 
a. The angle between links 2, and 3 is 9\ in the first realization and 62 in the second realization. 
The corresponding pair distance between neighboring links (2 and 4, in this case) is given by l\ in 
the first realization and I2 in the second. The relationship between links 4, and 5 remains constant 
(rigid) between the two structures, as indicated by the solids links, while the rest of the structure 
is more flexible 

The separation between the two next-nearest neighbor joints, as shown in Fig. [TJ is 
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The probability p(l) of measuring a value /, is 

p{l) =^-j d9sin(9)5 (l - a v / 2(l - cos(0))) 
I 

~ 2^2' 

For two random snapshot realizations, the standard deviation of the length, /, from its 
(0, is 
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FIG. 2: Diagrams of clusters of links in freely jointed chains. Each vertex represents a joint. Each 
line segment represents a linkage. Dark segments represent clusters of linkages. In general, it can 



be shown that the fraction of sites in a cluster of size r, P{r) is r(l — p c ) p r c , as in Eq. (15 ) 



At some cutoff, a c , the probability of two adjacent linkages being joined is p c = p{a < 
a c ). The probability of two adjacent linkages being not joined is 1 — p c . For a sufficiently 
long chain, any cluster must begin and end with disconnected links with a corresponding 
probability of (1 — p c ) 2 . Every vertex joining two links within a cluster has probability p c . 
There are r links in a cluster of size r, as shown for r from 1 to 3 in Fig. [2j Therefore, the 
probability P(r) of a site being in a cluster of r links is 



P(r) = r(l -p c ) p t 
The expectation value or r, (r), is given by 

oo 

(r) = rP(', 



2„r-l 



(15) 



r=l 
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(16) 

(17) 
(18) 



For small values of p c (corresponding to a small cutoff, a c ), (r) = 1, corresponding to single- 
site clusters, (r) increases monotonically with p c , and diverges as p c approaches 1, where all 
sites will be members of a single cluster. 

These ideas can be extended to iV > 2 conformations, but we have been unable to find a 



general closed form expression for p(cr), except for iV = 2 as given in Eq. (11 ). However, it 
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is possible to calculate the first few moments [27] of p(a) defined via the second moment, 



and the variance 



<- 4 >-<- 2 > 2 -^fi-4V+^(i-^)- 



which reduce to the results that can be obtained directly by integration from Eq. ( 11 ) when 
N = 2. Similar results can be obtained for any distribution that would replace the freely 
rotating chain given in Eq. (Ej). These results have some important implications that are 
independent of this particular model. The first is that the second moment of the distribution 
is only very weakly dependent on the number of conformations N used, which is important as 
it means that the sampling intervals are not important as long as the cluster stays together 
and remains intact. This is a quite general result as is intuitively clear. The second is 



that the variance, Eq. (20), goes to zero as iV — > oo, meaning that the distribution, p(cr), 
becomes a delta function, and that all the clusters are of size zero for a < a c and there is 
a single large cluster for a > a c , where for the freely rotating chain, a c /a = a/2/9 = 0.471, 



from Eq. (19). This means that Fig. [4] sharpens up to a step function at a c /a = 0.471, as 
the number of conformations of the freely rotating chain increases to infinity, which we have 
confirmed with simulations analyzed with TIMME. This result is special to the case where 
all elements are equivalent, and would not be true for the models considered in the next 
sections of this paper, which is of most interest. Quite generally, there is little dependence 
on the number of conformations, N, used in the sampling, provided, of course, that clusters 
do actually stay together at a particular cutoff, a c . 

Two independent 10,000-link freely-jointed chain realizations were created by randomly 
assigning each link an orientation from a uniform distribution over the surface of a sphere 
of radius a. Only nearest neighbors were considered during the TIMME analysis. There 



was good correspondence between the model, Eq. (15) and the simulation, as can be seen 
in Fig. [3] 
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FIG. 3: The fraction of sites in a cluster of size r ranging from 1 to 5 versus cr c /a, where a is the 



length of each chain link, from theory (dashed; see Eq. ( 15 )) and from a simulated 10,000 link long 
freely-jointed chain (solid). 



It can be useful to lump the largest clusters together via 

oo 

P(r >R) = J^P(r) (21) 

r=R 

R-l 

= 1-J>(r). (22) 

where P(r > R) represents the fraction of sites contained in a cluster of size greater than 
or equal to R. When all associations are by chance alone, this fraction typically takes a 
sigmmoidal form, as in Fig. |4j 
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FIG. 4: The fraction of sites in clusters of 10 or more plotted versus a c /a, where a is the length 



of each chain link, from theory (dashed) in Eq. (22), and from a simulated 10,000 link long 
freely-jointed chain (solid). 



B. Hard Spheres with Added Weak Attraction 

An interesting mesoscopic physical model consists of small polymethyl methacrylate 
(PMMA) spheres contained between two glass microscope slides and driven by an added 
interparticle attractive interaction of strength at contact of order ksT. This system can 
serve as a model for metallic glasses [281 EH] and other random sphere packs [30, |3TJ 132] . 

The influence of gravity is minimized by immersing the spheres in a density-matched 
fluid. The full three-dimensional coordinates of each sphere of a collection of hundreds to 
thousands of spheres within an imaging volume are observed using confocal microscopy in 
conjunction with shape recognition algorithms. The trajectory of individual spheres through 
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time can then be tracked by acquiring a succession of closely spaced snapshots at equally- 
spaced time intervals chosen to access the relevant timescales for capturing the motions 

[2S1I221E3E3]. 

Because they are weakly bonded, such systems of nearly-identical spheres organize into 
relatively rigid network structures and relatively free mobile clusters of one or more spheres, 
on timescales that are not too long. At long times, a bimodal distribution of motions was 
detected in the experimental analysis, corresponding to two populations of relatively jammed 
and relatively free spheres (23] • It is difficult to visually identify and distinguish between 
these two types of structures in a homogenous, packed population. These transiently jammed 
features are readily apparent in a TIMME analysis. 

TIMME analysis was performed on a data set consisting of an ensemble of 46 snapshots 
of 991 PMMA spheres [30, 34j. To ensure that the identified clusters are continuous in at 
least one snapshot, only pairs of spheres that were in contact in one or more frames of the 
confocal microscopy data were considered in the analysis. As in the freely-jointed chain, the 
fraction of spheres in clusters of 10 or more spheres grows roughly sigmoidally with cutoff, 
as seen in Fig. |5j 
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FIG. 5: The fraction of PMMA spheres contained within co-moving clusters containing 10 or more 
spheres, as a function of scaled cutoff (cr c /o, where a = 1.3/xm, the diameter of the spheres). 
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FIG. 6: Dilution of the clusters identified by TIMME for 46 snapshots of a system comprising 991 
PMMA spheres observed by confocal microscopy. Analysis was performed using a sparse matrix of 
Oab treating only physically adjacent pairs as nonzero. Co-moving clusters consisting of 3 or more 
spheres are shown in various colors, with the largest in dark-blue. Clusters containing fewer than 
3 spheres are suppressed in this plot for clarity. The dashed line shows the cutoff used in Fig. [7] 
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FIG. 7: Top-down view of a single representative snapshot from a collection of 46 snapshots of 
a system of 991 PMMA spheres, showing a kissing network. Co-moving clusters were computed 
at a cutoff of a c /a = 0.146 (the dashed line in Fig. [6]). Clusters consisting of 3 or more spheres 
are shown as thick sticks between individual spheres in various colors, with the largest, or the 'co- 
moving core', in dark-blue. Spheres that are not members of larger co-moving clusters are small 
grey balls shown here at approximately 1/2 their actual size to allow the viewer to see past them. 

In many systems, such as these PMMA spheres, there is no natural ordering for the 
basis objects. In such systems, it is convenient to define a 'site label' based on TIMME 
dilution. This site label is chosen such that at all cutoffs, any particular co-moving cluster is 
continuous. Additionally, as the cutoff decreases and clusters bifurcate, sites in the largest 
cluster are assigned the smaller site labels so that in dilution plots the larger clusters will 
tend to segregate to the left and smaller clusters to the right. Although this condition is 
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imposed at each breaking point in the dilution, it is possible that a cluster that was assigned 
smaller site labels will later break up and become smaller than a cluster with larger site 
labels. 

These site labels are used in the dilution plot shown in Fig. |6j An inflection in the growth 
of the core at a cutoff of approximately a c /a = 0.147 reflects a transition between detecting 
physically jammed clusters and random clusters. A representative snapshot realization is 
shown in Fig. [7] at a cutoff just below the inflection point. 



C. Water 



A simulated droplet of 1000 explicit water molecules was equilibrated at room temper- 
ature for 0.8 ns. This droplet was then subjected to 0.2 ns molecular dynamics simulation 
using NAMD[1\ at standard temperature (273.15 Kelvin) and pressure (1 atmosphere). 

In real water, individual molecules can auto-ionize into hydroxide and hydronium ions. 
Such behavior is not present this molecular dynamics run. Thus, the simulated drop of 
pure water consisted of identical molecules. Like the system of PMMA spheres, there is 
the potential for geometric jamming creating another source for co-moving structures on 
some timescale. In addition to the hard-shell repulsion, water molecules can form transient 
hydrogen bonds. These bonds allow groups of two or more water molecules to form short- 
lived structures such as tetrahedra, chains, and rings [351 ESS 137] . 

This combination of interactions creates a more complicated hierarchy of clusters with a 
large number of intrinsic time and length scales [381 139], as revealed by TIMME analysis the 
molecular dynamics trajectory. There is sufficient disorder in the system that the fraction 
of water molecules in a cluster of size 10 molecules or more follows a sigmoidal curve, as can 
be seen in Fig. [8j Clusters of associated water molecules are visible as colored stalactites 
dangling from the core in the dilution plot shown in Fig. 19} A representative snapshot of 



the water simulation is shown in Fig. 10 
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FIG. 8: The fraction of atoms in co-moving clusters of 10 or more water molecules, as identified by 
TIM ME, in a 0.2 ns NAMD[1_ molecular dynamics simulation of 1000 water molecules in vacuum 
as a function of cutoff, a c , normalized by the oxygen-hydrogen bond length (a = 0. 
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FIG. 9: Dilution plot of the clusters identified by TIMME for a 0.2 ns MD trajectory of a liquid 
water drop consisting of 1000 molecules at standard temperature and pressure. The dashed line 



shows the cutoff used in Fig. 10 
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FIG. 10: A single snapshot of a simulated drop of 1000 water molecules, in vacuum. Coloring is by 
membership in individual co-moving clusters, as identified by TIMME at a cutoff of <r c /a = 0.54 
(the dashed line in Fig. [9j. Those co-moving clusters spanning more than one water molecule are 
shown as thick colored sticks, while those spanning only a single water molecule are shown as thin 
grey sticks. At this cutoff, the largest cluster spans 17 molecules. 



D. Proteins 

Proteins are linear polymers of amino acids. Their structure depends on their sequence, 
subject to the environment in which they are expressed [4"0] |4"T| [321 SSI S3]- A typical protein 
has a relatively rigid internal scaffolding that acts as a support to hold more flexible domains 
into specific relative positions and orientations. Fexible domains can serve to facilitate 
protein-protein interactions, to bind substrates, or to serve as a hinge between two or more 
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relatively rigid scaffolds. 

Understanding which domains are more rigid and which are more flexible can give signif- 
icant insight into how a protein performs its functions as a molecular machine. In addition 
to the intrinsic interest of knowing how a protein works, such understanding can be of prac- 
tical use for suggesting possible targets for drug design to facilitate or inhibit a protein's 
performance US SSI S3 HH HH] • 

A number of experimental techniques are used to determine protein structures. Two of the 
most common are X-ray diffraction [501 EI] and Nuclear Magnetic Resonance (\.\IK) 52. '>'■) . 

X-ray diffraction techniques diffract a beam of X-rays from highly purified crystals of 
a protein. Solution of the phases of the resulting diffraction pattern gives a reciprocal 
space representation of electron density in the protein. Computation of the inverse Fourier 
transform of that reciprocal space representation provides the real-space electron density, 
from which a representative structure for the protein can be inferred. 

NMR techniques measure internal distance and orientation constraints within an ensem- 
ble of structures. By imposing these constraints on a molecular dynamics simulation of the 
protein, an ensemble of structures consistent with the constraints can be generated. As- 
suming that a sufficiently large number of constraints have been measured, the structures 
reported from an NMR experiment will reflect some of the more populated states in the 
conformational space of the protein. 

Rigid clusters can be inferred from static structures using a graph theoretic struc- 
ture analysis system, called 'Floppy Inclusions and Rigid Substructure Topography', or 
FIRST\5Q [55]. FIRST balances constraints and degrees of freedom to determine which 
regions of a molecule will be rigid and which will be flexible, under a specific set of con- 
straints. The constraints used for this analysis include covalent bonds, hydrophobic tethers, 
and hydrogen bonds with strength greater than a user-specified cutoff. Alternatively, Molec- 
ular Dynamics simulations can be performed to explore which domains of a static protein 
structure are relatively rigid and which are relatively flexible. 

TIMME was originally developed by us to aid in the analysis of ensembles of protein 
conformations, and to provide an alternative to FIRST when more than a single confor- 
mation was available. Conceptually, FIRST rigid clusters correspond roughly to TIMME 
co-moving clusters at some cutoff. By selecting an appropriate cutoff, static structures, such 
as those reported from X-ray crystallography experiments, can be compared to ensembles 
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of two or more structures, such as those reported from Nuclear Magnetic Resonance (NMR) 
structure determination experiments. 

When analyzing molecular frameworks, it is useful to consider only nearby pairs of bonded 
atoms to avoid identifying discontinuous clusters and to improve performance. Adjacent 
bonded atoms must be excluded because otherwise, a is will only be a measure of thermal 
fluctuations in bond lengths and angles. Also, if no thermal bond length and angle fluctua- 
tions were present in an ensemble, then any bonded molecule would be identified as a single 
co-moving cluster at any cutoff. Relative motion between third-nearest neighbor atoms 
reflects bond torsion, possibly restricted by distant steric and charge constraints. Even con- 
sidering this, a well folded structure will have limited local motion and so the values of a 
will typically be small compared with the system. 



1. Barnase 



Barnase is a 110-residue water-soluble protein that is produced and secreted by Bacillus 
amyloliquefaciens. By degrading RNA, barnase can stop protein synthesis and eventually 
kill neighboring cells. To protect itself from the action of barnase, B amyloliquefaciens 
coexpresses an inhibitor called barnstar|56j. Within a cell expressing barnase, barnase is 
very tightly bound to barnstar[57j. For this reason, barnase is commonly used in studies of 
protein-protein interactions, and protein dynamics. 

As an example of the correspondence between rigid clusters identified from static struc- 
tures and those from dynamic structures, a FIRST analysis was performed for a static X-ray 
structure of barnase (pdb ID: la2p|58]). and compared with TIMME analysis of a corre- 
sponding NMR ensemble (pdb ID: lbnr[2]), and of a 0.1 ns molecular dynamics simulation 
starting from the X-ray structure. 

We found that the best agreement between FIRST and TIMME results occurred with 
respective cutoffs of approximately -0.05 kcal/mol and o c ja = 0.04. Unlike the freely-jointed 
chain, water, and PMMA sphere systems, the fraction of atoms in the largest comoving 



cluster of barnase, shown in Fig. 11, is non-sigmoidal, reflecting greater structure imposed 
by the covalent bond network and non-covalent interactions. Although they are not identical, 



a casual inspection of Fig. 13 reveals significant overlap between the rigid clusters identified 
by FIRST, and the co-moving clusters identified by TIMME for the NMR and a 0.1ns 
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NAMDpQ MD ensembles. 
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FIG. 11: The fraction of main-chain a— carbon atoms in co-moving clusters of 10 or more main- 
chain a— carbon atoms in barnase (pdb ID: lbnr|2 ]) computed by TIMME. The x-axis shows the 
cutoff, a c scaled by the carbon-carbon bond length (a = 1.54 A). 
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FIG. 12: Dilution of the clusters identified by TIMME for the main-chain a— carbon atoms, ordered 
by residue number, from an NMR ensemble of barnase (pdb ID: lbnr[2]). The side-chain atoms 
which are suppressed for clarity have higher a values and tend to join the clusters at higher a c . The 
y-axis shows the cutoff, a c scaled by the carbon-carbon bond length (a = 1.54 A). The horizontal 



dashed line shows the value of the cutoff used in Fig. 13 



Proteins have an intrinsic primary sequence, which provide a natural site label for dilution 



plots, such as that shown in Fig. 12 With appropriate choice of cutoff, the rigid clusters 
identified by FIRST for a static X-ray structure (pdb ID: lbnr|2j) corresponds well with 
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those identified by TIMME for an NMR ensemble (pdb ID: lbnr[2]) and a 0.1 ns NAMD\S\ 
molecular dynamics simulation of the X-ray structure. This correspondance can be seen in 



Fie. 13 



FIRST NMR + TIMME Ml)+ TIMME 



(a) (b) (c) 

FIG. 13: Analysis of the bacterial ribonuclease, barnase, by FIRST and TIMME. Panel (A) shows 
a rigid cluster decomposition of an X-ray crystallography structure (pdb ID: la2p, chain A|58]). at 
an energy cutoff of -0.048 kcal/mol. Individual rigid clusters are colored in order of their size, with 
the largest cluster, or 'rigid core', in dark blue. Panel (b) shows a co-moving cluster decomposition 
of an NMR ensemble in aqueous solution (pdb ID: lbnr[2j), at a c /a = 0.04 (dashed line in Fig. 



12) applied to a single snapshot for visualization. As in panel (a), individual co-moving clusters 
are colored to match the corresponding clusters in panel (a). Panel (c) shows a co-moving cluster 
decomposition of an MD simulation based on an X-ray crystallography structure (pdb ID: la2p, 
chain A[58]), at a cutoff, a c /a = 0.068, with clusters colored to match corresponding clusters in 
panel (a). 



2. Adenylate Kinase 

Adenylate Kinase (ADK) is a ubiquitous enzyme thatcatalyzes formation of Adenylate 
Triphosphate (ATP) and Adenylate Monophosphate (AMP) from two molecules of Adeny- 
late Diphosphate (ADP), as well as the reverse reaction, 2 ADP < — > AMP + ATP. 

ATP is a nucleotide - one of the bases incorporated into DNA and RNA. It is also a 
common cellular energy currency. It is used by many different proteins to drive endothermic 
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reactions. Energy is stored by the bonding of phosphate to AMP or ADP in a dehydration 
reaction. By coupling hydrolysis of ATP to ADP to an endothermic reaction, enzymes within 
biological cells are able to drive energetically unfavorable reactions [5 SJ EDI EI]- In addition to 
the role in cellular energetics, phosphate groups from ATP are used in regulation. Enzymes 
called protein kinases cleave phosphate from ATP and covalently bond the phosphate to 
proteins. Phosphorylation acts as a switch by altering the activity of a protein. A common 
example involves transduction by cascades of protein kinases of a signal from a cellular 
sensor to the nucleus of a cell in order to up- or down-regulate expression of one or more 
proteins [521 ES El E2] • 

ADK ensures that the concentrations of ATP, ADP and AMP within a cell quickly reach 
equilibrium, to facilitate energy production and use[3l ESI 16^ - 

FIRST analysis was performed for a static snapshot from an NMR ensemble for ADK 
(pdb ID: lp4s model: 1 [3]), and compared with TIMME analysis of the complete NMR 
ensemble (pdb ID: lp4s[3j). Cutoff distances were selected to maximize the overlap between 
FIRST rigid clusters and TIMME co-moving clusters. FIRST analysis was performed with 
hydrogen bonds stronger than -0.185 kcal/mol included in the constraint counting. TIMME 
analysis was performed with a c /a = 0.09, where a is the carbon-carbon bond length, 1.54 



A. At these cutoffs, there is significant overlap, as can be seen in Fig. 14 
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(a) (b) 

FIG. 14: Analysis of Adenylate Kinase by FIRST and TIMME. Panel (a) shows a rigid cluster 
decomposition of a model from an NMR ensemble in aqueous solution (pdb ID: lp4s [3]), at a 
hydrogen-bond energy cutoff of -0.185 kcal/mol. Individual rigid clusters are colored in order of 
their size, with the largest cluster, or the 'rigid core', in dark blue. Panel (b) shows a co-moving 
cluster decomposition of an NMR ensemble of ADK in aqueous solution (pdb ID: lp4s [3i), at 
a c /a = 0.09, applied to a single snapshot for visualization. As in panel (a), individual co-moving 
clusters are colored according to their size in the leftmost figure, with the largest co-moving cluster, 
or the 'co-moving core', colored in dark blue. 

IV. DISCUSSION 

Understanding and documenting the evolution of the particles in inhomogenous many- 
particle systems is the subject of this article. Computer simulations and experimental obser- 
vation of these large systems go hand-in-hand in studying these large systems. The results 
of such simulations or experiments include such a large quantity of information that data 
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mining techniques are essential for understanding and revealing the underlying properties 
of such systems. A number of approaches have been developed to that end, including factor 
analytic and clustering techniques. 

A new method for identifying clusters of sites that move in concert is introduced in this 
paper. This approach, implemented as the Tool for Identification of Mobility in Macro- 
molecular Ensembles (TIMME), combines a number of the strengths of factor analysis and 
hierarchical classification. 

The TIMME algorithm has been applied to a number of sample test cases. First, the 
method was demonstrated on a freely-jointed polymer chain with no other intrinsic struc- 
ture, where an analytic solution was available for comarison. Second, PMMA spheres were 
analyzed using data obtained via confocal microscopy, Third, the technique was applied to a 
Molecular Dynamics simulation of a drop of water, which exhibited a complicated hierarchy 
of organization due to hydrogen bonding and other weak interactions between different adja- 
cent water molecules. Finally, two proteins in aqueous solution observed by NMR structure 
determination experiments have been analyzed. In both protein cases, the observed hierar- 
chical structure we compute corresponds to known properties of the proteins, and also to 
the hierarchical structure determined from static X-ray structures with the software FIRST. 

Incomplete sampling is arguably the most severe limitation of the TIMME algorithm. 
Because the algorithm requires an estimate of a a b for each pair of sites a and b, it is sensitive 
to under-sampling, bias and errors. This limitation is not unique to TIMME, but it is a 
fundamental problem for any analysis approach based on incomplete information |68j. 

One complementary approach to analyzing dynamics of many-bodied systems is to per- 
form a Principal Components Analysis (PCA) on a representative collection of 3n dimen- 
sional snapshot states. PCA is a linear transform that identifies the specific directions of 
motion within the system that contribute most to the system's overall motion. A PCA can 
be thought of as similar to a Fourier transform, except that the basis vectors are empirically 
derived to satisfy certain properties. In particular, the projection of all snapshot states 
onto the first PCA basis vector has the largest variance of any linear projection; with the 
projection onto the second PCA basis vector having the largest variance of any other linear 
projection orthogonal to the first, and so on. 

There are a number of algorithms for performing a PCA. One common approach begins 
by computing a covariance matrix for the 3n dimensional snapshot states. If x a represents 
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the coordinate vector of a in a basis, and x_ b the coordinates of 6, then the covariance of a 
and b is 

cov ab = ((x a - (x a )) ■ (x b - (x b ))) (23) 

where the expectation value (x) is shorthand for the arithmetic mean value of x. 

The covariance of a and b indicates how much a and b vary together. If the positions of a 
and b are correlated, then cov ab > 0. If they are anti-correlated, then cov ab < 0, and if they 
are uncorrelated, then cov ab = 0. In this way, cov ab can be used to identify the correlated 
motions within an ensemble. 

The bilinear, symmetric 3iV x 3N matrix cov ab is independent of uniform global trans- 
lations and rotations of all snapshots, but not independent of relative translations and 
rotations between snapshots. 

Diagonalizing the covariance matrix yields a set of basis vectors for the system. Sorted 
in decreasing order of the corresponding diagonal elements, these vectors correspond to the 
directions of motion with decreasingly large contribution to the variance. These basis vectors 
are called principal components and hence the approach is called a Principal Components 
Analysis (PC A) [2D EH ED] . 

Most often, this process is implemented by performing a Singular Value Decomposition 
(SVD) on the covariance matrix. This matrix factorization can roughly be thought of as 
a generalized eigenvector /eigenvalue decomposition. In an SVD, a m x n matrix M is 
factorized in the form 

M = £ i ¥} (24) 

where U_ is an m x n matrix of basis vectors for the output range, E is an n x n matrix with 
the singular values as the only nonzero entries down the diagonal, and V} is the conjugate 
transpose of an n x n matrix of basis vectors for the input domain [71]. 

Decomposition into co-moving clusters in TIMME at a given cutoff corresponds roughly 
to the model-free approach of decomposing NMA or PCA covariance matrices into a block- 
diagonal form [23]. The individual blocks in the matrix correspond to distinct quasi- rigid 
domains, which can be thought of as co-moving clusters. In model-free methods, a fixed 
set of co-moving clusters are chosen initially, and used throughout the analysis. TIMME 
extends this approach to decomposing an entire system into successively smaller quasi-rigid 



30 



domains ranging from the entire system as a whole to the smallest indivisible rigid building 
blocks. 

V. CONCLUSIONS 

The algorithm TIMME developed in this paper, is an objective and systematic approach 
to analyzing ensembles of snapshots of a dynamical system. Co-moving cluster decomposi- 
tions based on pair-distance statistics provide a useful way to understand detailed features 
in such systems that are not readily visible to an unaided observer. To do this, it is necessary 
to have two or more snapshots of particle configurations are available for analysis. 

Because the algorithm relies on internal pair distances, the results are independent of 
choice of coordinates, and system-wide translations and rotations between snapshot config- 
urations. The algorithm makes no assumptions about the number or type of objects being 
analyzed, and can be applied to any system with two or more snapshot realizations with 
a metric that allows a pair distance to be measured and monitored. Although all of the 
examples shown here were embedded in 3-dimensional space and modeled physical objects, 
this method could be applied 2-dimensional and also to high-dimensional abstract data sets 
to identify a hierarchy of clusters. 

In order to apply the algorithm, individual particles must be enumerated so that they 
can be tracked. This can be accomplished in two ways. In a protein, the atoms are naturally 
numbered by their position along the polypeptide chain. In a collection of identical objects 
like PMMA spheres, the situation is more complex, and the snapshots must be taken at 
sufficiently short time intervals such that no sphere has moved further than about a radius 
so that it can be continuously tracked unambiguously. 

Finally we note that rigid cluster do not have to be contiguous in three and higher 
dimensions [721 173] . This is a very counter-intuitive situation where mutually rigid and 
con-contiguous pieces can form a single rigid cluster, while being embedded in a fluid. The 
present algorithm could be used to identify such (comparatively rare) situations, by moni- 
toring all pairs and not just close neighbors, in order to build up the rigid clusters. To our 
knowledge, no such situations have ever been identified in the laboratory. 
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APPENDIX A: PROGRAM AVAILABILITY 



TIMME has been implemented as a module of FIRST. The combination FIRST/TIMME 



is available gratis to academic users via FlexWeb (http:/ /flexweb. asu.edu/software/timme). 



The source code can be downloaded as part of the FIRST suite or the program can be run 
interactively online using FlexWeb. Note that the graphics used in Figs. ([7J [ToJ [l3j and 14) 
were generated with PyMol [74] . 
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